Skip to content

[BugFix] Fix NOT IN including null/missing rows (#5165)#5337

Merged
qianheng-aws merged 1 commit into
opensearch-project:mainfrom
songkant-aws:fix/not-in-null-filter-5165
Apr 14, 2026
Merged

[BugFix] Fix NOT IN including null/missing rows (#5165)#5337
qianheng-aws merged 1 commit into
opensearch-project:mainfrom
songkant-aws:fix/not-in-null-filter-5165

Conversation

@songkant-aws

Copy link
Copy Markdown
Collaborator

Description

When PredicateAnalyzer handles a SEARCH expression with complemented points (NOT IN) and nullAs=UNKNOWN, the generated OpenSearch DSL query was missing an exists filter. This caused documents with null/missing field values to be incorrectly included in NOT IN results.

Root cause: In the case UNKNOWN branch of the SEARCH handler (~line 728), the expression was returned as-is without adding an exists check for complemented points. SQL three-valued logic dictates that NULL NOT IN (...) evaluates to UNKNOWN (not TRUE), so null rows must be excluded.

Fix: Check isSearchWithComplementedPoints(call) in the UNKNOWN branch and wrap with AND exists filter, consistent with the existing FALSE branch behavior.

Related Issues

Resolves #5165

Check List

  • New functionality includes testing
  • Commits signed per DCO (-s)
  • spotlessCheck passed
  • Unit tests passed
  • Integration tests passed

… filter (opensearch-project#5165)

When PredicateAnalyzer handles a SEARCH expression with complemented points
(NOT IN) and nullAs=UNKNOWN, the generated DSL query now includes an exists
filter to exclude null/missing documents. This aligns with SQL three-valued
logic where NULL NOT IN (...) evaluates to UNKNOWN, not TRUE.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@songkant-aws

Copy link
Copy Markdown
Collaborator Author

Decision Log

Root Cause: PredicateAnalyzer.java line 728 — the case UNKNOWN branch of the SEARCH handler returned the expression without an exists filter. For NOT IN (complemented points), this meant the generated DSL was must_not(termsQuery) without exists, so documents with null/missing fields passed through.

Approach: Added isSearchWithComplementedPoints(call) check in the UNKNOWN branch, wrapping with AND exists filter when true. This mirrors the existing correct behavior in the FALSE branch (lines 718-722). The fix is minimal — 7 lines changed in production code.

Alternatives Rejected:

  • Modifying Calcite's nullAs generation to always produce FALSE for NOT IN — too invasive, changes Calcite semantics upstream.
  • Adding a post-processing filter in the execution layer — would defeat pushdown optimization.

Pitfalls: The UNKNOWN branch also handles non-complemented cases (positive IN, ranges). Those must NOT get an exists filter added since positive matches naturally exclude nulls. The isSearchWithComplementedPoints check ensures only NOT IN gets the fix.

Things to Watch: Any future Sarg patterns with nullAs=UNKNOWN and negation semantics should be reviewed for the same null-handling gap.

@github-actions

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Fix NOT IN null exclusion in PredicateAnalyzer with unit tests

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/request/PredicateAnalyzerTest.java

Sub-PR theme: Add integration tests for NOT IN null filter fix

Relevant files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotInNullFilterIT.java
  • integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5165.yml

⚡ Recommended focus areas for review

Incomplete Fix

The fix only adds the exists filter when isSearchWithComplementedPoints(call) is true in the UNKNOWN branch. However, it's worth verifying whether there are other SEARCH expressions with nullAs=UNKNOWN that are NOT complemented points but might also need special handling. The original code returned expression as-is for all UNKNOWN cases; now only the complemented-points sub-case is changed. Confirm that non-complemented UNKNOWN SEARCH expressions (e.g., regular IN with UNKNOWN) are still handled correctly and don't need an exists filter.

case UNKNOWN ->
    isSearchWithComplementedPoints(call)
        ? CompoundQueryExpression.and(
            false, expression, QueryExpression.create(pair.getKey()).exists())
        : expression;
Missing Field Test

The YAML test only verifies int_field: null (explicit null), but the PR description mentions "null/missing" rows. A document with a completely missing int_field (no field at all in the document body) should also be excluded by the NOT IN filter. Consider adding a document without int_field to verify that missing-field documents are also excluded.

- '{"index": {"_index": "issue5165", "_id": "1"}}'
- '{"int_field": 42}'
- '{"index": {"_index": "issue5165", "_id": "2"}}'
- '{"int_field": -1}'
- '{"index": {"_index": "issue5165", "_id": "3"}}'
- '{"int_field": 0}'
- '{"index": {"_index": "issue5165", "_id": "4"}}'
- '{"int_field": 2147483647}'
- '{"index": {"_index": "issue5165", "_id": "5"}}'
- '{"int_field": null}'
Settings API Usage

The query.settings API used to enable/disable Calcite (plugins.calcite.enabled) as a transient setting may not be the standard approach for test setup/teardown in this codebase. Verify this is the correct API and that the setting is reliably reset in teardown even if the test fails.

setup:
  - do:
      query.settings:
        body:
          transient:
            plugins.calcite.enabled: true

@github-actions

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Verify correct call reference in null-handling logic

The isSearchWithComplementedPoints(call) check is applied to the outer call
parameter of the binary method, but at this point in the switch expression, call
refers to the original binary RexCall, not necessarily the SEARCH call being
analyzed. Verify that call here correctly refers to the SEARCH call with
complemented points, otherwise the exists filter may not be applied when expected,
or may be incorrectly applied to non-NOT-IN cases.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java [731-735]

+case UNKNOWN ->
+    isSearchWithComplementedPoints(call)
+        ? CompoundQueryExpression.and(
+            false, expression, QueryExpression.create(pair.getKey()).exists())
+        : expression;
 
-
Suggestion importance[1-10]: 3

__

Why: The suggestion asks to verify that call refers to the correct SEARCH call, but the existing_code and improved_code are identical, meaning no actual change is proposed. This is a verification request rather than a concrete fix, and the surrounding context suggests call is indeed the correct reference passed to binary().

Low
General
Use standard cluster settings API in test setup

Using transient cluster settings means the setting will be lost after a cluster
restart, but more critically, if a previous test run fails before teardown, the
plugins.calcite.enabled: true setting may persist and affect other tests. Consider
whether a persistent setting or a different mechanism is more appropriate, and
ensure the teardown reliably resets the setting even on test failure.

integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5165.yml [1-6]

 setup:
   - do:
-      query.settings:
+      cluster.put_settings:
         body:
           transient:
             plugins.calcite.enabled: true
Suggestion importance[1-10]: 3

__

Why: The suggestion to use cluster.put_settings instead of query.settings raises a valid concern about test isolation, but the teardown already resets the setting and query.settings appears to be a custom API used consistently in this codebase. The impact is minor.

Low
Clarify test comment to match sorted assertion order

The comment states age values are 32, 36, 28, 33, 36, null, 34 and NOT IN (32, 28)
should return 36, 33, 36, 34. After sorting ascending, the expected order should be
33, 34, 36, 36, which matches the assertion. However, if the test data in the index
has changed or differs from the comment, the assertion could silently pass with
wrong data. Ensure the comment and assertion are kept in sync with the actual test
index data.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteNotInNullFilterIT.java [36]

+// age values in BANK_WITH_NULL_VALUES: 32, 36, 28, 33, 36, null, 34
+// NOT IN (32, 28) excludes 32, 28, and null → remaining sorted: 33, 34, 36, 36
 verifyDataRowsInOrder(result, rows(33), rows(34), rows(36), rows(36));
Suggestion importance[1-10]: 1

__

Why: This suggestion only proposes adding a more detailed comment to the test, which is a documentation-only change with minimal functional impact. The existing comment already conveys the intent adequately.

Low

@qianheng-aws qianheng-aws merged commit c26897d into opensearch-project:main Apr 14, 2026
58 of 61 checks passed
@songkant-aws songkant-aws deleted the fix/not-in-null-filter-5165 branch May 15, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] NOT IN includes null/missing rows due to missing exists filter in pushdown

3 participants